Skip to content

Key cache by HMAC of mapping content, auto-reload on mtime change#36

Merged
jdoss merged 1 commit intomasterfrom
fix/cache-hmac-key-and-auto-reload
Apr 17, 2026
Merged

Key cache by HMAC of mapping content, auto-reload on mtime change#36
jdoss merged 1 commit intomasterfrom
fix/cache-hmac-key-and-auto-reload

Conversation

@jdoss
Copy link
Copy Markdown
Contributor

@jdoss jdoss commented Apr 17, 2026

Summary

The cache was keyed by Podman's hex secret ID, which Podman regenerates every time setup runs its delete+create cycle. This caused the cache to be silently useless after every refresh timer fire — serve's in-memory dict had the old IDs, Podman was handing out new IDs, 100% cache miss rate until a manual serve restart.

Observed on the test server: 1554 lookups across 30 minutes after a refresh fire, every single one fell through to the provider. When Infisical then went down, every container secret lookup failed with 502 — the exact outage the cache was built to prevent.

Root-cause fix

  • Cache keys are HMAC-SHA256 of the mapping's canonical JSON bytes, not the Podman hex ID. Same mapping always yields the same key, no matter how often Podman churns the hex IDs. The HMAC key is random, per-host, stored inside the encrypted cache envelope — mapping hashes cannot be correlated across deployments.
  • Serve calls cache.maybe_reload() on the lookup hot path. A stat() per request (~1μs); actual reload only when setup has rewritten the file. Rotations propagate to serve without a restart.

Cleanups enabled by the new design

  • Drop _prune_stale_cache_entries from setup — no stale entries to prune when keys are content-derived.
  • Drop the id_map return from _register_secrets — cache doesn't need hex IDs any more.
  • Drop ExecStart=systemctl try-restart psi-secrets.service from the refresh wrapper — auto-reload handles it.

Migration

Legacy v1 payloads (hex-ID keyed) are discarded on load; next save rewrites in v2 format with a freshly generated HMAC key. Container lookups during the one-time transition fall through to the provider exactly once.

Test plan

  • pytest — 344 tests pass.
    • New TestCacheKey: HMAC stable across save/load, different mappings produce different keys, per-host keys prevent cross-correlation.
    • New TestMaybeReload: no reload when mtime unchanged, reload when another writer updates the file, graceful handling of missing file.
    • New TestLegacyV1PayloadDiscarded: v1 payloads ignored cleanly with fresh HMAC key.
    • Updated test_serve_offline.py to use cache.cache_key() for key construction.
  • ruff check / ruff format --check / ty check — clean.
  • Deploy to test server, confirm cache survives refresh timer fires and Infisical outages without serve restart.

The cache was keyed by Podman's hex secret ID, which Podman regenerates
every time setup runs its delete+create cycle. This caused the cache
to be silently useless after every refresh timer fire — serve's in-
memory dict had the old IDs, Podman was handing out new IDs, 100%
cache miss rate until a manual serve restart.

Observed on the test server: 1554 lookups across 30 minutes after a
refresh fire, every single one fell through to the provider. When
Infisical then went down, every container secret lookup failed with
502 — the exact outage the cache was built to prevent.

Root-cause fix, one change:

- Cache keys are HMAC-SHA256 of the mapping's canonical JSON bytes,
  not the Podman hex ID. Same mapping always yields the same key, no
  matter how often Podman churns the hex IDs. The HMAC key is random,
  per-host, stored inside the encrypted cache envelope — mapping hashes
  cannot be correlated across deployments.
- Serve calls cache.maybe_reload() on the lookup hot path. A stat()
  per request (~1μs); actual reload only when setup has rewritten the
  file. Rotations propagate to serve without a restart.

Cleanups enabled by the new design:

- Drop _prune_stale_cache_entries from setup — no stale entries to
  prune when keys are content-derived.
- Drop the id_map return from _register_secrets — cache doesn't need
  hex IDs any more.
- Drop ExecStart=systemctl try-restart psi-secrets.service from the
  refresh wrapper — auto-reload handles it.

Legacy v1 payloads (hex-ID keyed) are discarded on load; next save
rewrites in v2 format with a freshly generated HMAC key. Container
lookups during the one-time transition fall through to the provider
exactly once.
@jdoss jdoss merged commit ffca21d into master Apr 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant